2022-10-27
Perhaps the most popular data science methodologies come from the field of machine learning.
Machine learning success stories include the handwritten zip code readers implemented by the postal service, speech recognition technology such as Apple’s Siri, movie recommendation systems, spam and malware detectors, housing price predictors, and driverless cars.
In machine learning, data comes in the form of:
the outcome we want to predict and.
the features that we will use to predict the outcome.
We want to build an algorithm that takes feature values as input and returns a prediction for the outcome when we don’t know the outcome.
The machine learning approach is to train an algorithm using a dataset for which we do know the outcome, and then apply this algorithm in the future to make a prediction when we don’t know the outcome.
Here we will use \(Y\) to denote the outcome and \(X_1, \dots, X_p\) to denote features.
Note that features are sometimes referred to as predictors or covariates.
We consider all these to be synonyms.
Prediction problems can be divided into categorical and continuous outcomes.
For categorical outcomes, \(Y\) can be any one of \(K\) classes.
The number of classes can vary greatly across applications.
For example, in the digit reader data, \(K=10\) with the classes being the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.
In speech recognition, the outcomes are all possible words or phrases we are trying to detect.
Spam detection has two outcomes: spam or not spam.
In this book, we denote the \(K\) categories with indexes \(k=1,\dots,K\).
However, for binary data we will use \(k=0,1\) for mathematical conveniences that we demonstrate later.
The general setup is as follows.
We have a series of features and an unknown outcome we want to predict:
| outcome | feature 1 | feature 2 | feature 3 | feature 4 | feature 5 |
|---|---|---|---|---|---|
| ? | \(X_1\) | \(X_2\) | \(X_3\) | \(X_4\) | \(X_5\) |
| outcome | feature 1 | feature 2 | feature 3 | feature 4 | feature 5 |
|---|---|---|---|---|---|
| \(y_{1}\) | \(x_{1,1}\) | \(x_{1,2}\) | \(x_{1,3}\) | \(x_{1,4}\) | \(x_{1,5}\) |
| \(y_{2}\) | \(x_{2,1}\) | \(x_{2,2}\) | \(x_{2,3}\) | \(x_{2,4}\) | \(x_{2,5}\) |
| \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
| \(y_n\) | \(x_{n,1}\) | \(x_{n,2}\) | \(x_{n,3}\) | \(x_{n,4}\) | \(x_{n,5}\) |
When the output is continuous we refer to the machine learning task as prediction, and the main output of the model is a function \(f\) that automatically produces a prediction, denoted with \(\hat{y}\), for any set of predictors: \(\hat{y} = f(x_1, x_2, \dots, x_p)\).
We use the term actual outcome to denote what we ended up observing.
So we want the prediction \(\hat{y}\) to match the actual outcome \(y\) as well as possible.
Because our outcome is continuous, our predictions \(\hat{y}\) will not be either exactly right or wrong, but instead we will determine an error defined as the difference between the prediction and the actual outcome \(y - \hat{y}\).
When the outcome is categorical, we refer to the machine learning task as classification, and the main output of the model will be a decision rule which prescribes which of the \(K\) classes we should predict.
In this scenario, most models provide functions of the predictors for each class \(k\), \(f_k(x_1, x_2, \dots, x_p)\), that are used to make this decision.
When the data is binary a typical decision rules looks like this: if \(f_1(x_1, x_2, \dots, x_p) > C\), predict category 1, if not the other category, with \(C\) a predetermined cutoff.
Because the outcomes are categorical, our predictions will be either right or wrong.
Notice that these terms vary among courses, text books, and other publications.
Often prediction is used for both categorical and continuous outcomes, and the term regression can be used for the continuous case.
Here we avoid using regression to avoid confusion with our previous use of the term linear regression.
In most cases it will be clear if our outcomes are categorical or continuous, so we will avoid using these terms when possible.
Let’s consider the zip code reader example.
The first step in handling mail received in the post office is sorting letters by zip code: